Lightweight Structured Text Processing

نویسندگان

Rob Miller

Brad A. Myers

چکیده

Text is a popular storage and distribution format for information, partly due to generic text-processing tools like Unix grep and sort. Unfortunately, existing generic tools make assumptions about text format (e.g., each line is a record) that limit their applicability. Custom-built tools are one alternative, but they require substantial time investment and programming expertise. We describe a new approach, lightweight structured text processing, which overcomes these difficulties by enabling users to define text structure interactively and manipulate the structure with generic tools. Our prototype system, LAPIS, is a web browser that can highlight, filter, and sort text regions described by the user. LAPIS has several advantages over other systems: (1) the ability to define custom structure with a simple, intuitive pattern language; (2) interactive specification, showing pattern matches in context and letting users choose the most convenient combination of manual selection and pattern matching; and (3) external parsers for standard text formats. The pattern language in LAPIS, text constraints, describes text structure in high-level terms, with region relationships like before, after, in, and contains. We describe an implementation of text constraints using a novel, compact representation of region sets as collections of rectangles, or region intervals. We also illustrate some examples of applying LAPIS to web pages, text files, and source code. Appeared in Proceedings of USENIX 1999 Annual Technical Conference, Monterey, CA, June 1999, pp 131–144. Outstanding paper award.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

On Lightweight Data Summaries for Optimised Query Processing over Linked Data

Typical approaches for querying structured Web Data collect (crawl) and pre-process (index) large amounts of data in a central data repository before allowing for query answering. This time-consuming pre-processing phase however leverages the benefits of Linked Data – where structured data is accessible live and up-to-date at distributed Web resources that may change constantly – only to a limi...

متن کامل

Effects of Structured Input and Meaningful Output on EFL Learners' Acquisition of Nominal Clauses

The current second language (L2) instruction research has raised great motivation for the use of both processing instruction and meaningful output instruction tasks in L2 classrooms as the two focus-on-form (FonF) instructional tasks. The present study investigated the effect of structured input tasks (represented by referential and affective tasks) compared with meaningful output tasks (implem...

متن کامل

Answering Definition Questions via Temporally-Anchored Text Snippets

A lightweight extraction method derives text snippets associated to dates from the Web. The snippets are organized dynamically into answers to definition questions. Experiments on standard test question sets show that temporally-anchored text snippets allow for efficiently answering definition questions at accuracy levels comparable to the best systems, without any need for complex lexical reso...

متن کامل

Assignment of ontology-based broad semantic classes to biomedical text

Natural language processing of biomedical text benefits from the ability to recognize broad semantic classes, but the number of semantic types is far bigger than is usually treated in newswire text. A method for broad semantic class assignment using lightweight linguistic analysis is described and evaluated using traditional and novel methods.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

Lightweight Structured Text Processing

نویسندگان

چکیده

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

On Lightweight Data Summaries for Optimised Query Processing over Linked Data

Effects of Structured Input and Meaningful Output on EFL Learners' Acquisition of Nominal Clauses

Answering Definition Questions via Temporally-Anchored Text Snippets

Assignment of ontology-based broad semantic classes to biomedical text

عنوان ژورنال:

اشتراک گذاری